Conceptual

Exercise 1: p-values

Describe the null hypotheses to which the p-values given in Table 3.4 correspond. explain what conclusions you can draw based on these p-values. Your explanation should be phrased in terms of sales, tv, radio, and newspaper.

We can conclude that - in the absence of any tv, radio, or newspaper sales, sales of the product willbe non-zero. - there is a relationship between tv expenditure and sales, and data suggest that increment in sales due to one unit increment in tv or radio expenditure is statistically significant and not due to random chance assuming the null hypothesis is true. - no statistically significant relationship was found between expenditure on newspaper advertising and sales.

To simplify, the null hypothesis assumes that there is no statistically significant relationship between sales and expenditure on tv, newspaper, or radio. The p-value associated with each of these predictors is the conditional probability of observing a relationship between sales and the predictor assuming the null hypothesis is true. This probability is very high due for the intercept, TV, and radio but very large for newpspaper, which means we cannot reject the null hypothesis for newspaper but can reject the null hypothesis in favour of the alternate hypothesis for tv, radio, and intercept.

Exercise 2: KNN Classifier vs Regressor

Carefully explain the differences between the KNN classifier and KNN regression methods. Both are non-parametric/instance-based approach for assigning a response y^ to a data point x0 based on the response y1...N0 of some N0 nearest neighbors.

However, the nature of the response and the method for deriving it is different. The KNN regression method considers the nearest N0 data points for x0 and assumes y0^ = 1N0Σi...N0(yi). The response is continuous.

The KNN classifier considers the nearest N0 data points for x0 and assumes the predicted class for x0 is the most frequently occurring/majority/modal class of the neighbors. More specifically, it calculates the probability that x0 belongs to a particular class based on the frequency of classes in the N0 neigbhors. The response is categorical or discrete.

Exercise 3: Students and Salaries

Suppose we have a dataset with five predictors: X1 = GPA, X2 = IQ, X3 = Level (1 for College, 0 for High School), X4 = Interaction between GPA and IQ, $X_5% = Interaction between GPA and Level. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model and get β0^ = 50, β1^ = 20, β2^ = 0.07, β3^ = 35, β4^ = 0.01, β5^ = -10.

Part(a)

Which answer is correct and why? 1. For a fixed value of IQ and GPA, high school graduates earn more, on average, than college graduates. 2. For a fixed value of IQ and GPA, college graduates earn more, on average, than high school graduates. 3. For a fixed value of IQ and GPA, high school graduates earn more, on average, than high school graduates provided that the GPA is high enough. 4. For a fixed value of IQ and GPA, college graduates earn more, on average, than high school graduates provided that the GPA is high enough.

The regression equation is $$ = _0 + _1 GPA + _2 IQ + _3 Level + _4 GPA IQ + _5 GPA Level

\ = _0 + (_1 + _4 IQ + _5Level)GPA + _2 IQ + _3 Level

\ = 50 + (20 + 0.01IQ - 10 Level) GPA + 20 IQ + 35 Level $$ For the same value of IQ and GPA, the regression equation changes as follows with Level

ySchool^=50+(20+0.01×IQ)×GPA+20×IQyCollege^=85+(10+0.01×IQ)×GPA+20×IQ

iq <- seq(80, 130, 0.1)         
gpa <- seq(2.0, 4.0, 0.1)
iq.gpa.df <- data.frame(
  iq = rep(iq, each = length(gpa)), 
  gpa = rep(gpa, length(iq))
) 
# iq.gpa.df %>% dplyr::group_by(iq) %>% dplyr::summarize(min_gpa = min(gpa), max_gpa = max(gpa), total = n())

iq.gpa.df <- iq.gpa.df %>% 
  dplyr::mutate(
    earnings_school = 50 + (20 + 0.01 * iq) * gpa + 20 * iq,
    earnings_college = 85 + (10 + 0.01 * iq) * gpa + 20 * iq
  )

iq.gpa.df %>% dplyr::filter(earnings_school > earnings_college) %>% 
  dplyr::summarize(min_iq = min(iq), max_iq = max(iq), 
                   min_gpa = min(gpa), max_gpa = max(gpa), 
                   max_diff = max(earnings_school - earnings_college))
##   min_iq max_iq min_gpa max_gpa max_diff
## 1     80    130     3.5       4        5

This elementary analysis shows that if the GPA is high enough, high school students can outearn college students although not by a large amount.

Part (b)

Predict the salary of a college graduate with IQ of 110 and GPA of 4.0.

y^=β0+β1×GPA+β2×IQ+β3×Level+β4×GPA×IQ+β5×GPA×Level

y^=50+20×GPA+0.07×IQ+35×Level+0.01×GPA×IQ10×GPA×Level

iq <- 110 
gpa <- 4.0
is.college <- 1.0

salary <- 50 + 20 * gpa + 0.07 * iq + 35 * is.college + 0.01 * gpa * iq - 10 * gpa * is.college

print(paste0("Salary for such a student is $", salary * 1000))
## [1] "Salary for such a student is $137100"

Part (c)

True or False: Since the coefficient for GPA/IQ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.

False. Significance is not determined by the magnitude of the coefficient. Significance is determined by the p value. Even if the coefficient of the GPA/IQ interaction term is very small, in the absence of a p value we can’t say whether this is evidence of an interaction effect.

Exercise 4: Linear and Cubic Regression

I collect a set of data (n = 100 observations) containing a simple predictor and quantitative response. then fit a linear regression model to the data, as well as a separate cubic regression i.e. Y = β0+β1X+β2X2+β3X3+ϵ

Part (a)

Suppose the true relationship between X and Y is linear i.e .Y=B0+B1X+ϵ. Consider the training residual sum of squares (RSS) for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.

We’d expect the training RSS to be higher for the cubic regressor than the linear regressor. Addition of second and third order polynomial features will increase variance and allow the model to overfit the data, which will decrease training RSS.

Part (b)

Answer (a) using test rather than training RSS Test RSS will be much higher for the cubic regressor compared to the linear regressor because the true underlying relatioship is linear whereas the cubic regressor has learnt to overfit to noise in the training data.

Part (c)

Suppose the true relationship between X and Y is not linear, but we don’t know how far it is from linear. Consider the training RSS for the linear and cubic regressions. Would we expect one to be lower tha nthe other, to be the same ,or is there not enough information to tell? Justify your answer. Will still expect training RSS to be lower for cubic regression than linear regression. The extent to which the errors differ depend on just how non-linear the data is. If the data is very non-linear, then the cubic regressor will have a much lower training RSS. If the data is only slightly non-linear, then the cubic regressor will have a slightly lower training RSS.

Part (d)

Answer (c) using test RSS rather than training RSS. Test RSS for cubic regressor will be lower than linear regressor’s. Again, extent of improvement depends on just how non-linear the data is and whether the non-linearity is closer to a straight line than it is to a 3rd degree polynomial.

Applied

Exercise 8 - Simple Linear Regression

Part(a)

Use the lm function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary function to print the results. Comment on the output.

lm.mpg <- lm(mpg ~ horsepower, data = Auto)
summary(lm.mpg)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

Comments

Is there a relationship between the predictor and response? Yes. The p-value associated with the predictor is very small, suggesting that assuming there was no relationship between the predictor and response, the probability of observing a strong association between the predictor and response due to random chance is very small.

Furthermore, the F-statistic of this regressor is >> 1, which also confirms that there is an association between the predictor and response.

How strong is the relationship between the predictor and the response?

response.mean <- mean(Auto$mpg)
fit.rse <- summary(lm.mpg)$sigma
print(paste0("Response Mean: ", round(response.mean, 4), " | RSE = ", round(fit.rse, 4), " | Percentage Error = ", round(fit.rse / response.mean * 100, 2)))
## [1] "Response Mean: 23.4459 | RSE = 4.9058 | Percentage Error = 20.92"

The residual standard error is a bit high, which suggests that the relationship is moderate. The R-squared statistic is ~60%, which means 60% of the variance in the response was explained by regression, which is again evidence of moderate relationship.

Is the relationship between the predictor and response positive or negative? - The relationship is negative. - For every unit increase in horsepower, mpg decreases by -0.1578 units.

What is the predicted mpg associated with a horsepower of 98? What are the associated 95% confidence and prediction intervals?

# Confidence Interval
rbind(
  data.frame(
    list(
      'data' = 98, 
      'interval' = 'confidence',
      predict(lm.mpg, data.frame(horsepower = 98), interval = 'confidence')
    )
  ), 
  
    data.frame(
    list(
      'data' = 98, 
      'interval' = 'prediction',
      predict(lm.mpg, data.frame(horsepower = 98), interval = 'prediction')
    )
  )
)
##    data   interval      fit      lwr      upr
## 1    98 confidence 24.46708 23.97308 24.96108
## 11   98 prediction 24.46708 14.80940 34.12476

Part(b)

Plot the response and the predictor. Use the abline function to display the least squares regression line.

plot(Auto$horsepower, Auto$mpg, main = 'Auto Dataset - MPG against Horsepower')
abline(lm.mpg$coefficients, col = 'red')

Part(c)

Use the plot function to produce diagnositc plots of the least squares regression fit. Comment on any problems you see with the fit.

par(mfrow = c(2, 2))
plot(lm.mpg)

- Residuals show evidence of non-linearity which means we may want to use polynomial regression. - Standardized residuals show that there isn’t constant variance in the error terms. - Some data points also have a large standardized residual and leverage, and thus closer to Cook’s distance –> we may want to review these data points.

Exercise 9

The question involves the use of the multiple linear regression on the Auto dataset. ### Part (a) Produce a scatterplot matrix which includes all of the variables in the dataset.

plot(Auto, col = 'blue', main = "Auto Dataset - Variable pairplot")

Part (b)

Compute the matrix of correlations between the variables using the function cor.

round(cor(Auto[, -9]), 3)
##                 mpg cylinders displacement horsepower weight acceleration
## mpg           1.000    -0.778       -0.805     -0.778 -0.832        0.423
## cylinders    -0.778     1.000        0.951      0.843  0.898       -0.505
## displacement -0.805     0.951        1.000      0.897  0.933       -0.544
## horsepower   -0.778     0.843        0.897      1.000  0.865       -0.689
## weight       -0.832     0.898        0.933      0.865  1.000       -0.417
## acceleration  0.423    -0.505       -0.544     -0.689 -0.417        1.000
## year          0.581    -0.346       -0.370     -0.416 -0.309        0.290
## origin        0.565    -0.569       -0.615     -0.455 -0.585        0.213
##                year origin
## mpg           0.581  0.565
## cylinders    -0.346 -0.569
## displacement -0.370 -0.615
## horsepower   -0.416 -0.455
## weight       -0.309 -0.585
## acceleration  0.290  0.213
## year          1.000  0.182
## origin        0.182  1.000

Part (c)

Use the lm function to perform multiple linear regression with mpg as the response and all other variables except name as the predictors. Use summary to print the results. Comment on the output.

# After considering solutions at https://rpubs.com/lmorgan95/ISLR_CH3_Solutions, might be a good idea to encode the origin feature manually. 
# Auto$origin <- factor(Auto$origin, labels = c("American", "European", "Japanese"))
lm.mpg.multi <- lm(mpg ~ . - name, data = Auto)
summary(lm.mpg.multi)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Comments

Is there a relationship between the predictors and response? - F-statistic is 252.4 which is >> 1, so there is indeed a relationship between the predictors and response. - The p-value associated with the F-statistic is also practically 0, so we can accept the F-statistic as significant. - This, in turn, means that we can reject the null hypothesis that the coefficients associated with all the predictors are 0.

Which predictors appear t have a statistically significant relationship with the response? - Not all predictors are associated with the response. - Specifically, cylinders, horsepower, acceleration have very large p-values, suggesting no statistically significant association between these variables and the data.

What does the coefficient for the year variable suggest? The coefficient for year is

lm.mpg.multi$coeff['year']
##      year 
## 0.7507727

which suggests that for every additional year, the mpg increases by 0.777 units. This could mean that as cars continue to be used year-on-year, their mileage improves ever so slightly.

Use plot to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Deos the leverage plot identify any observations with very high leverage?

par(mfrow = c(2,2))
plot(lm.mpg)

#### Comments - Residuals show evidence of non-linearity in the data. We may have to use polynomial regresssion by creating higher order terms for some features. - Standardized residuals also so non-linearity, although they are between -3 and +3. - There are a lot of data points close to Cook’s distance, which suggests we do have data points with high leverage. - We also have data points with high leverage as well as high residuals. These have been extracted programmatically below.

lm.mpg.multi.high.leverage <- broom::augment(lm.mpg.multi) %>% 
  dplyr::select(.rownames, .fitted, .hat, .resid, .std.resid, .cooksd) %>% 
  dplyr::arrange(desc(.cooksd))
lm.mpg.multi.high.leverage %>% head(10)
## # A tibble: 10 x 6
##    .rownames .fitted   .hat .resid .std.resid .cooksd
##    <chr>       <dbl>  <dbl>  <dbl>      <dbl>   <dbl>
##  1 14          18.9  0.190   -4.88      -1.63  0.0778
##  2 394         34.5  0.0492   9.54       2.94  0.0558
##  3 327         31.5  0.0287  11.9        3.63  0.0488
##  4 387         28.4  0.0333   9.57       2.92  0.0368
##  5 326         32.9  0.0196  11.4        3.44  0.0296
##  6 245         32.1  0.0195  11.0        3.34  0.0278
##  7 323         33.5  0.0137  13.1        3.95  0.0271
##  8 112         27.6  0.0242  -9.59      -2.92  0.0264
##  9 330         35.0  0.0210   9.64       2.93  0.0230
## 10 45           6.25 0.0382   6.75       2.07  0.0213

Using broom and ggplot2, data points with CooksD >=4n - a common threshold - are idenified.

lm.mpg.multi.high.leverage$cooksd_cutoff <- factor(
  ifelse(lm.mpg.multi.high.leverage$.cooksd >= 4 / nrow(Auto), 'True', 'False'
  )
)

ggplot(lm.mpg.multi.high.leverage, aes(x = .hat, y = .std.resid, col = cooksd_cutoff))  +
  geom_point() +
  theme(legend.position = 'bottom') +
  labs(
    title = 'Auto - Residuals vs Leverage',
    x = 'Leverage', y = 'Standardized Residuals', 
    col = 'Cooks Distance >= 4/n'
  )

###Part(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

lm.mpg.multi.ixn <- lm(mpg ~ . *., data = Auto[, -9]) # Using all possible interactions between predictors 
summary(lm.mpg.multi.ixn)
## 
## Call:
## lm(formula = mpg ~ . * ., data = Auto[, -9])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16

Comments

  • F-statistic is 104.2 >> 1, and associated p-value is ~0 so we can continue to reject the null hypothesis that no predictor is associated with the response.
  • However, p-values for a lot of predictors are now extremely high, so it makes more sense to use the usual threshold of 0.05 for statistical significance.
  • Based on this threshold, the interactions that are most significant are
    • acceleration:origin
    • cceleration:year
    • displacement:year

Part(f) Variable Transformations

Try a few different transformations of the variables such as log(X), (X), X2. Report your finding.

# Write a simple function to identify best predictors, as reported on
# https://rpubs.com/lmorgan95/ISLR_CH3_Solutions
best_predictor <- function(dataframe, response) {
  
  if (sum(sapply(dataframe, function(x) {is.numeric(x) | is.factor(x)})) < ncol(dataframe)) {
    stop("Make sure that all variables are of class numeric/factor!")
  }
  
  # pre-allocate vectors
  varname <- c()
  vartype <- c()
  R2 <- c()
  R2_log <- c()
  R2_quad <- c()
  AIC <- c()
  AIC_log <- c()
  AIC_quad <- c()
  y <- dataframe[ ,response]
  
  # # # # # NUMERIC RESPONSE # # # # #
  if (is.numeric(y)) {
    # Iterate over each column of the dataframe
    for (i in 1:ncol(dataframe)) {
      # Extract the column
      x <- dataframe[ ,i]
      
      # Extract the column name
      varname[i] <- names(dataframe)[i]
      
      # Define the class of the column
      if (class(x) %in% c("numeric", "integer")) {
        vartype[i] <- "numeric"
      } else {
        vartype[i] <- "categorical"
      }
      
      # If the column is not the same as the response (why check for this?)
      if (!identical(y, x)) {
        # R-squared statistic for the linear: y ~ x
        R2[i] <- summary(lm(y ~ x))$r.squared 
        
        # R-squared statistic for the log-transformation: y ~ log(x)
        if (is.numeric(x)) { 
          # if y ~ log(x) for min(x) <= 0, do y ~ log(x + abs(min(x)) + 1)
          if (min(x) <= 0) { 
            R2_log[i] <- summary(lm(y ~ log(x + abs(min(x)) + 1)))$r.squared
          } else {
            R2_log[i] <- summary(lm(y ~ log(x)))$r.squared
          }
        } else {
          # Don't need to do anything for categorical variables
          R2_log[i] <- NA
        }
        
        # R-squared statistic for the quadratic transformation: y ~ x + x^2
        if (is.numeric(x)) { 
          R2_quad[i] <- summary(lm(y ~ x + I(x^2)))$r.squared
        } else {
          R2_quad[i] <- NA
        }
        
      } else {
        # If the target and the predictor columns are identical, then populate with NAs
        R2[i] <- NA
        R2_log[i] <- NA
        R2_quad[i] <- NA
      }
    }
    
    print(paste("Response variable:", response))
    
    data.frame(varname, 
               vartype, 
               R2 = round(R2, 3), 
               R2_log = round(R2_log, 3), 
               R2_quad = round(R2_quad, 3)) %>%
      mutate(max_R2 = pmax(R2, R2_log, R2_quad, na.rm = T)) %>%
      arrange(desc(max_R2))
    
    
    # # # # # CATEGORICAL RESPONSE # # # # #
  } else {
    
    for (i in 1:ncol(dataframe)) {
      # As before,iterate over all columns in the dataframe and extract each column one-by-one
      x <- dataframe[ ,i]
      varname[i] <- names(dataframe)[i]
      
      # Populate class as ebefore
      if (class(x) %in% c("numeric", "integer")) {
        vartype[i] <- "numeric"
      } else {
        vartype[i] <- "categorical"
      }
      
      # For all columns except the response
      if (!identical(y, x)) {
        
        # Get the Akaike Information Criterion for the linear model: y ~ x
        # AIC is derived using a generalized linear model
        AIC[i] <- summary(glm(y ~ x, family = "binomial"))$aic 
        
        # Get the Akaike Information Criterion for the log-transform: y ~ log(x)
        if (is.numeric(x)) { 
          if (min(x) <= 0) { # if y ~ log(x) for min(x) <= 0, do y ~ log(x + abs(min(x)) + 1)
            AIC_log[i] <- summary(glm(y ~ log(x + abs(min(x)) + 1), family = "binomial"))$aic
          } else {
            AIC_log[i] <- summary(glm(y ~ log(x), family = "binomial"))$aic
          }
        } else {
          AIC_log[i] <- NA
        }
        
        # Get the Akaike Information Criteration for the quadratic transformatio : y ~ x + x^2
        if (is.numeric(x)) { 
          AIC_quad[i] <- summary(glm(y ~ x + I(x^2), family = "binomial"))$aic
        } else {
          AIC_quad[i] <- NA
        }
        
      } else {
        AIC[i] <- NA
        AIC_log[i] <- NA
        AIC_quad[i] <- NA
      }
    }
    
    print(paste("Response variable:", response))
    
    data.frame(varname, 
               vartype, 
               AIC = round(AIC, 3), 
               AIC_log = round(AIC_log, 3), 
               AIC_quad = round(AIC_quad, 3)) %>%
      mutate(min_AIC = pmin(AIC, AIC_log, AIC_quad, na.rm = T)) %>%
      arrange(min_AIC)
  } 
}

Running the best_predictor function over the entire dataset yields these results.

best_predictor(Auto[, -9], "mpg")
## [1] "Response variable: mpg"
##        varname vartype    R2 R2_log R2_quad max_R2
## 1       weight numeric 0.693  0.713   0.715  0.715
## 2 displacement numeric 0.648  0.686   0.689  0.689
## 3   horsepower numeric 0.606  0.668   0.688  0.688
## 4    cylinders numeric 0.605  0.603   0.607  0.607
## 5         year numeric 0.337  0.332   0.368  0.368
## 6       origin numeric 0.319  0.330   0.332  0.332
## 7 acceleration numeric 0.179  0.190   0.194  0.194
## 8          mpg numeric    NA     NA      NA     NA

Visualizing the feature importances of the raw features along with their log-transformed and quadratic versions.

# Get the R2 statistic for raw feature, as well as log transformed and quadratic transformed versions
mpg_predictors <- best_predictor(Auto[, -9], "mpg") %>% 
  dplyr::select(-c(vartype, max_R2)) %>%                   
  tidyr::gather(key = "key", value = "R2", -varname) %>% 
  dplyr::filter(!is.na(R2))
## [1] "Response variable: mpg"
# Map each feature to a level based on their max R2
mpg_predictors_order <- best_predictor(Auto[, -9], "mpg") %>% 
  dplyr::select(varname, max_R2) %>% 
  dplyr::filter(!is.na(max_R2)) %>% 
  dplyr::arrange(desc(max_R2)) %>% 
  dplyr::pull(varname)
## [1] "Response variable: mpg"
# Make the varname in the features df a factor
mpg_predictors[['varname']] <- factor(mpg_predictors$varname, ordered = T, levels = mpg_predictors_order)

ggplot(mpg_predictors, aes(x = R2, y = varname, col = key, group = varname)) +
  geom_line(col = 'grey15') +
  geom_point(size = 2) + 
  theme_light() +
  theme(legend.position = 'bottom') + 
  labs(title = "Best Predictors (& Transformations) for mpg", 
       col = "Predictor Transformation", 
       y = "Predictor")

Based on these R^2 variables, we may want to use quadratic transformatins onf displacement, weight, year, and horspepower.

lm.mpg.multi.quad <- lm(mpg ~ . - name + I(weight^2) + I(displacement^2) + I(horsepower^2) + I(year^2), data = Auto) 
summary(lm.mpg.multi.quad)
## 
## Call:
## lm(formula = mpg ~ . - name + I(weight^2) + I(displacement^2) + 
##     I(horsepower^2) + I(year^2), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8607 -1.4222  0.0293  1.3596 11.7816 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        4.121e+02  6.969e+01   5.912 7.51e-09 ***
## cylinders          5.892e-01  3.154e-01   1.868 0.062514 .  
## displacement      -4.415e-02  1.929e-02  -2.289 0.022606 *  
## horsepower        -1.861e-01  3.928e-02  -4.737 3.06e-06 ***
## weight            -9.982e-03  2.485e-03  -4.016 7.13e-05 ***
## acceleration      -1.837e-01  9.631e-02  -1.907 0.057279 .  
## year              -1.003e+01  1.838e+00  -5.455 8.86e-08 ***
## origin             5.900e-01  2.558e-01   2.307 0.021594 *  
## I(weight^2)        1.052e-06  3.344e-07   3.146 0.001784 ** 
## I(displacement^2)  7.243e-05  3.324e-05   2.179 0.029954 *  
## I(horsepower^2)    4.604e-04  1.331e-04   3.458 0.000605 ***
## I(year^2)          7.090e-02  1.207e-02   5.875 9.25e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.782 on 380 degrees of freedom
## Multiple R-squared:  0.8765, Adjusted R-squared:  0.873 
## F-statistic: 245.3 on 11 and 380 DF,  p-value: < 2.2e-16

This improves the adjusted R^2 quite significantly from ~80% to 87.3%. However, there is still evidence of non-linearity in residuals i.e. a fan shape in the standardized residuals and the deviation from the diagonal in the quantile-quantile plot.

par(mfrow = c(2, 2))
plot(lm.mpg.multi.quad)

Trying to resolve this with a log transform of the response.

lm.mpg.multi.log <- lm(log(mpg) ~ . - name, data = Auto)
summary(lm.mpg.multi.log)
## 
## Call:
## lm(formula = log(mpg) ~ . - name, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40955 -0.06533  0.00079  0.06785  0.33925 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.751e+00  1.662e-01  10.533  < 2e-16 ***
## cylinders    -2.795e-02  1.157e-02  -2.415  0.01619 *  
## displacement  6.362e-04  2.690e-04   2.365  0.01852 *  
## horsepower   -1.475e-03  4.935e-04  -2.989  0.00298 ** 
## weight       -2.551e-04  2.334e-05 -10.931  < 2e-16 ***
## acceleration -1.348e-03  3.538e-03  -0.381  0.70339    
## year          2.958e-02  1.824e-03  16.211  < 2e-16 ***
## origin        4.071e-02  9.955e-03   4.089 5.28e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1191 on 384 degrees of freedom
## Multiple R-squared:  0.8795, Adjusted R-squared:  0.8773 
## F-statistic: 400.4 on 7 and 384 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(lm.mpg.multi.log)

Final model makes the following changes - log transform the response - use quadratic predictors for features that are related non-linearly to log(mpg) - use the most significant interaction terms - use a brand variable from the name of the vehicle

Auto_2 <- Auto %>% mutate(mpg = log(mpg)) 

# Extract the first item from each list element
Auto_2$brand <- sapply(strsplit(as.character(Auto_2$name), split  = " "), function (x){x[1]})

# Fixing typos
Auto_2$brand <- factor(ifelse(Auto_2$brand %in% c("vokswagen", "vw"), "volkswagen", 
                            ifelse(Auto_2$brand == "toyouta", "toyota", 
                                   ifelse(Auto_2$brand %in% c("chevroelt", "chevy"), "chevrolet", 
                                          ifelse(Auto_2$brand == "maxda", "mazda", 
                                                 Auto_2$brand)))))
# Collapse into 10 categories
Auto_2$brand <- forcats::fct_lump(Auto_2$brand, n = 9, other_level = "uncommon")

# Don't need the name feature if we're using brand
Auto_2$name <- NULL

# Fit the model with the best interactions and transfrmations 
lm.mpg.final <- lm(mpg ~ . + I(horsepower^2) + I(year^2) + acceleration:year + acceleration:origin, data = Auto_2)

summary(lm.mpg.final)
## 
## Call:
## lm(formula = mpg ~ . + I(horsepower^2) + I(year^2) + acceleration:year + 
##     acceleration:origin, data = Auto_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.37893 -0.05907 -0.00022  0.05997  0.34458 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.504e+01  2.651e+00   5.671 2.86e-08 ***
## cylinders           -6.190e-03  1.109e-02  -0.558 0.577147    
## displacement        -2.793e-04  2.723e-04  -1.026 0.305691    
## horsepower          -7.975e-03  1.343e-03  -5.937 6.69e-09 ***
## weight              -1.392e-04  2.537e-05  -5.487 7.60e-08 ***
## acceleration        -1.709e-01  4.509e-02  -3.790 0.000176 ***
## year                -2.726e-01  7.080e-02  -3.851 0.000139 ***
## origin              -1.898e-01  5.847e-02  -3.246 0.001276 ** 
## brandbuick           6.736e-02  3.375e-02   1.996 0.046691 *  
## brandchevrolet       3.399e-02  2.585e-02   1.315 0.189367    
## branddatsun          1.802e-01  3.959e-02   4.551 7.23e-06 ***
## branddodge           5.784e-02  2.957e-02   1.956 0.051239 .  
## brandford           -5.201e-03  2.592e-02  -0.201 0.841093    
## brandplymouth        6.287e-02  2.854e-02   2.203 0.028228 *  
## brandtoyota          1.039e-01  3.863e-02   2.691 0.007454 ** 
## brandvolkswagen      7.267e-02  3.448e-02   2.108 0.035729 *  
## branduncommon        8.069e-02  2.589e-02   3.117 0.001971 ** 
## I(horsepower^2)      1.752e-05  4.231e-06   4.141 4.29e-05 ***
## I(year^2)            1.788e-03  4.789e-04   3.733 0.000219 ***
## acceleration:year    1.828e-03  5.963e-04   3.065 0.002335 ** 
## acceleration:origin  1.116e-02  3.488e-03   3.201 0.001490 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1051 on 371 degrees of freedom
## Multiple R-squared:  0.9093, Adjusted R-squared:  0.9044 
## F-statistic:   186 on 20 and 371 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(lm.mpg.final)

## Exercise 10: Carseats (Multiple Linear Regression) This question should be answered using the Carseats data set.

Part(a)

Fit a multiple regression model to predict Sales using Price, Urban, and US.

lm.sales <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(lm.sales)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Part(b)

Provide an interpretation of each coefficient in the model. - Intercept (13.04): The baseline Sale price is 13. That is, assuming a constant price and assuming the sale was made for a store outside the US and in a non-Urban area, the average sale price was 13.04. - Price: (-0.05445): A one unit increase in price results in a decrease of 54 units of sales across all markets. - UrbanYes: (-0.0219): If the store is located in an Urban location, then there are, on average, 22 fewer sales expected. However, since the p-value associated with this effect is high, there is no statistical significance of this effect. - USYes: (1.200): Quantifies the increase in sales if a store is located in the US. If a store is located in the US, then the store is expected to have 1200 additional sales.

Part(c)

Write out the model in equation form.

Sales^=13.04340.054459×Price0.021916×Urban+1.200573×US
Where - US = 1 if the store is in he US, 0 otherwise - Urban = 1 if the store is in an urban location, 0 otherwise

Part(d)

For which of the predictors can you reject the null hypothesi H0=βj=0? - We can reject the null hypothesis for the Price and US predictors because their p-values are quite small. - We cannot reject the null hypothesis for the Urban predictor because there does not seem to be a statistically significant relationship between this variable and the response.

Part(e)

On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

lm.sales.smaller <- lm(Sales ~ Price + US, data = Carseats)
summary(lm.sales.smaller)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

Part(f)

How well do the models in (a) and (e) fit the data? - Model (a): lm.sales - has an adjusted R-squared statistic of 0.2335, RSE of 2.472 - Model (b): lm.sales.smaller - has an adjusted R-squared statistic of 0.2354, RSE of 2.469

The reason a has a higher RSE than e because the adjusted R^2 statistic is dependent on minimizing RSSnp1. Here, the number increase in the denominator is not offset by the decrease in the RSS (numerator). RSS will always decrease with the addition of a new variable, but adjusted R2 does not.

Part (g)

Using the model from (e), obtain the 95% confidence intervals for the coefficients..

The 95% confidence interval for all coefficients are defined as follows

β^i+tn2(0.975)×SE(βi)^
I.e. the mean + standard error scaled by the t-statistic interval corresponding to 97.5% probability with n - 2 degrees of freedom, n being the number of data points.

In R, this is computed as follows:

confint(lm.sales.smaller, level = 0.95)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Part (h)

Is there evidence of outliers or high leverage observations in the model from (e)?

par(mfrow = c(2, 2))
plot(lm.sales.smaller)

The initial diagnostic plot shows there are many data points beyond Cooks distance n the Residuals vs Leverage plot.

Exploring these in more detal.

# Leverage statistic is always equal to (p + 1) / n, so we can identify points with high leverage using this threshold
# hatvaluse gets the leverage for each data point
round(((2 + 1) / nrow(Carseats)), 10) == round(mean(hatvalues(lm.sales.smaller)), 10)
## [1] TRUE

Identify any potential outliers as data points whose standardized residual is outside [-2, 2].

broom::augment(lm.sales.smaller) %>% 
  dplyr::select(.hat, .std.resid) %>% 
  ggplot(aes(x = .hat, y = .std.resid)) + 
  geom_point() +
  geom_hline(yintercept = -2, size = 1, color = 'red') + 
  geom_hline(yintercept = 2, size = 1, color = 'red') + 
  geom_vline(xintercept = 3 / nrow(Carseats), size = 1, color = 'blue') +
  labs(x = 'Leverage', y = 'Standardized Residual', title = 'Residuals vs Leverage')

A more quantitative approach is to use Cook’s distance.

broom::augment(lm.sales.smaller) %>%        
  tibble::rownames_to_column("rowid") %>% 
  dplyr::arrange(desc(.cooksd)) %>% 
  dplyr::select(Sales, Price, US, .std.resid, .hat, .cooksd)
## # A tibble: 400 x 6
##    Sales Price US    .std.resid    .hat .cooksd
##    <dbl> <dbl> <fct>      <dbl>   <dbl>   <dbl>
##  1 14.9     82 No          2.58 0.0116   0.0261
##  2 14.4     53 No          1.73 0.0237   0.0243
##  3 10.6    149 No          2.32 0.0126   0.0228
##  4 15.6     72 Yes         2.17 0.0129   0.0205
##  5  0.37   191 Yes        -1.42 0.0286   0.0198
##  6 16.3     92 Yes         2.87 0.00664  0.0183
##  7  3.47    81 No         -2.10 0.0119   0.0177
##  8  0.53   159 Yes        -2.05 0.0119   0.0169
##  9 13.6     89 No          2.18 0.00983  0.0158
## 10  3.02    90 Yes        -2.56 0.00710  0.0157
## # … with 390 more rows

Exercise 11: Generated Data (No Intercept)

In this problem, we will investigate the t-statistic for the null hypothesis H0:β=0 in a simple linear regression model without an intercept. To begin, we generate a predictor x and a response y as follows.

set.seed(1)
x <- rnorm(100)
y <- 2 * x + rnorm(100)

Part(a)

Predict y using x without any intercept. Report the coefficient estimate βa, the standard error of this coefficient estimate, and the t-statistic and p-value associated with the null hypothesis H0:βa=0. Comment on these results.

lm.no.intercept.01 <- lm(y ~ x + 0)
summary(lm.no.intercept.01)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9154 -0.6472 -0.1771  0.5056  2.3109 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   1.9939     0.1065   18.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared:  0.7798, Adjusted R-squared:  0.7776 
## F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16

Comments

  • Coefficient = 1.9939
  • Standard Error = 0.1065
  • t-statistic = 18.73
  • p-value = <= 2×1016

These statistics suggest that there is strong evidence of a relationship between the predictor and the response. This, in turn, means that the least squares line is of the form

yi^=βa×xi=1.9939×xi
### Part (b)

Perform a simple linear regression of x onto y without an intercept. Report the same information as in the previous question.

lm.no.intercept.02 <- lm(x ~ y + 0)
summary(lm.no.intercept.02)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8699 -0.2368  0.1030  0.2858  0.8938 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y  0.39111    0.02089   18.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4246 on 99 degrees of freedom
## Multiple R-squared:  0.7798, Adjusted R-squared:  0.7776 
## F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16

Comments

  • Coefficient: 0.3911
  • Standard Error = 0.02089
  • t-statistic = 18.73
  • p-value = <= 2×1016

This regression analysis also provides significant evidence against the null hypothesis: there is strong evidence of a relationship between y and x.

Part(c)

What is the relationship between the results obtained in (a) and (b)?

This link has some great analysis, which I will replicate here to improve my own understanding.

Both models have the same t-statstic associated with their coefficients, but the values of the coefficient estimates and their standard errors are different.

This is because in the first model, we are regressing y onto x and in the second model we are regressing x onto y.

In case of the first model, the linear relationship is y=2x+ϵ. Algebraically reversing this gives us x=yϵ2. In this case, we’d expect the coefficient to be 0.5 but the regression gives us an estimate of 0.399, which isn’t close to 0.5.

This is because there are different levels of noise in the data, and they have different effects on the linear regression estimates.

When we regress y onto x, RSS = in(yiy¯)2. Minimizing this vertical distance yields a slightly flatter line.

data <- data.frame(x, y)

ggplot(data, aes(x, y)) + 
  geom_point(color = 'red') + 
  geom_segment(aes(x = x, y = y, xend = x, yend = lm.no.intercept.01$fitted.values)) + 
  geom_abline(intercept = 0, slope = coef(lm.no.intercept.01), size = 1) +
  labs(title = 'LM1 - Predicting `y` using `x`')

In the case of the second model, we are minimising RSS=i=1N(xix¯)2. This gives us a line with a steeper slope.

ggplot(data, aes(x, y)) + 
  geom_point(color="red") + 
  geom_segment(aes(x = x, y = y,
                   xend = lm.no.intercept.02$fitted.values, yend = y)) + 
  geom_abline(intercept = 0, slope = 1 / coef(lm.no.intercept.02), size = 1) + 
  labs(title = "LM2: predicting x using y")

As the noise in the dataset will decrease, we will see the two lines converge to each other.

ggplot(data, aes(x, y)) + 
  geom_point(color="red") + 
  geom_abline(intercept = 0, slope = coef(lm.no.intercept.01), size = 1, col = "deepskyblue3") + 
  geom_abline(intercept = 0, slope = 1 / coef(lm.no.intercept.02), size = 1, col = "mediumseagreen") +
  labs(title = "LM1 vs LM2")

Exercsie 12: Reverse Regression

Part (a)

Recall that the coefficient estimate βa^ for the linear regression of y onto x without an intercept is given by equation 3.38. Under what circumstance is the coefficient estimate for the regression of x onto y the same as the coefficient estimate for the regression of y onto x?

y^=βa^xx^=βb^yβa^=xy(x2)βb^=yx(y2)Equating coefficientsβa^=βb^xyx2=yxy2i=1Nxi2=i=1Nyi2

This means the coefficients of regressing x onto y and y onto x will be equal when the sum of squares of the predictors in the regression equation are the same.

Part (b)

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of x onto y is different from the coefficient estimate of y onto x.

# This was actually done in the previous question, but we'll do it again.
set.seed(2)
x <- rnorm(100)
y <- 5 * x + rnorm(100, sd = 2)
data <- data.frame(x, y)

# Fit regressors
lm.y.onto.x <- lm(y ~ x + 0) 
lm.x.onto.y <- lm(x ~ y + 0)

# Confirm sum of squares is different
sum.squares.x <- sum(x ^ 2) 
sum.squares.y <- sum(y ^ 2)
cat(paste0("Sum of Squares of x: \t", round(sum.squares.x, 3), " | Sum of Squares of y:\t", round(sum.squares.y, 3)))
## Sum of Squares of x:     133.352 | Sum of Squares of y:  3591.294
# Get the coefficients for the slopes 
slope.coef.y.onto.x <- coef(lm.y.onto.x)
slope.coef.x.onto.y <- coef(lm.x.onto.y)
cat(paste0("\nSlope of y ~ x:\t", round(slope.coef.y.onto.x, 3), " | Sum of Squares of x ~ y:\t", round(slope.coef.x.onto.y, 3)))
## 
## Slope of y ~ x:  4.905 | Sum of Squares of x ~ y:    0.182
# Can also plot the lines 
ggplot(data, aes(x , y)) + 
  geom_point(color = 'red') +
  geom_abline(intercept = 0, slope = coef(lm.y.onto.x), size = 1, col = 'deepskyblue3') + 
  geom_abline(intercept = 0, slope = 1 / coef(lm.x.onto.y), size = 1, col = 'mediumseagreen') + 
  labs(title = 'Regression Lines (y ~ x (blue) and x ~ y (green)))')

Part (c) Equal Coefficients

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of x ont y is the same as the coefficient estimate for the regression of y onto x.

# When predictor and response variables are equal, this will always be the case 
set.seed(3)
x <- rnorm(100)
y <- x
data <- data.frame(x, y)

lm.y.onto.x <- lm(y ~ x + 0)
lm.x.onto.y <- lm(x ~ y + 0)

# Confirm sum of squares is the same
sum.squares.x <- sum(x ^ 2) 
sum.squares.y <- sum(y ^ 2)
cat(paste0("Sum of Squares of x: \t", round(sum.squares.x, 3), " | Sum of Squares of y:\t", round(sum.squares.y, 3)))
## Sum of Squares of x:     72.566 | Sum of Squares of y:   72.566
# Get the coefficients for the slopes 
slope.coef.y.onto.x <- coef(lm.y.onto.x)
slope.coef.x.onto.y <- coef(lm.x.onto.y)
cat(paste0("\nSlope of y ~ x:\t", round(slope.coef.y.onto.x, 3), " | Sum of Squares of x ~ y:\t", round(slope.coef.x.onto.y, 3)))
## 
## Slope of y ~ x:  1 | Sum of Squares of x ~ y:    1
# Can also plot the lines 
ggplot(data, aes(x , y)) + 
  geom_point(color = 'red') +
  geom_abline(intercept = 0, slope = coef(lm.y.onto.x), size = 1, col = 'deepskyblue3') + 
  geom_abline(intercept = 0, slope = 1 / coef(lm.x.onto.y), size = 1, col = 'mediumseagreen') + 
  labs(title = 'Regression Lines (y ~ x (blue) and x ~ y (green))')

Question 13: Regression with Simulated Data

In this exercise, you will create some simulated data and will fit simple linear regression models to it.

Part (a)

Using the rnorm function, create a vector x containing 100 observations drawn from N(0,1)distribution.

set.seed(1)
x <- rnorm(100, mean = 0, sd = 1)

Part (b)

Using the rnorm function, create a vector eps containing 100 observations drawn from a N(0,0.25) distribution i.e. a normal distribution with mean zeor and variance 0.25.

eps <- rnorm(100, mean = 0, sd = sqrt(0.25)) # variance = sd ^ 2

Part (c)

Using x and eps, generate a vector y according to the model y=1+0.5x+ϵ. What is the length of the vector y? What are the values of β0 and β1 in this linear model?

# Creating target
y <- -1 + 0.5 * x + eps

# Confirming that 100 elements created
length(y) == length(x) & length(y) == length(eps)
## [1] TRUE
# True parameters are B_0 = -1 and B_1 = -1

Part (d)

Create a scatterplot displaying the relationship between x and y. Comment on what you observe.

data <- data.frame(x = x, y = y)
ggplot(data, aes(x = x, y = y)) + geom_point()

  • Positive linear relationship, as x increases, y also increases.
  • There is a lot of noise/spread in the data, so I expect RSE to be high for a linear model.
  • Intercept -1 and slope -5, as expected.

Part (e)

Fit a least squares linear model to predict y using x. Comment on the model obtained. How do β0^ and β1^ compare to β0 and β1?

lm.simulated <- lm(y ~ x, data = data)
summary(lm.simulated)
## 
## Call:
## lm(formula = y ~ x, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.93842 -0.30688 -0.06975  0.26970  1.17309 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.01885    0.04849 -21.010  < 2e-16 ***
## x            0.49947    0.05386   9.273 4.58e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4814 on 98 degrees of freedom
## Multiple R-squared:  0.4674, Adjusted R-squared:  0.4619 
## F-statistic: 85.99 on 1 and 98 DF,  p-value: 4.583e-15

The slope and intercept coefficients are very close to the actual values. The deviations are primarily due to the noise in the data. p-values associated with both coefficients are practically 0, so we have no reason to believe these associations are due to chance (as expected).

However, the R2 is low and RSS is relatively high - this is due to the noise in the data.

Part (f)

Display the least squares line on the scatterplot obtained in (d). Draw the population regression line in the plot with a different color.

g1 <- ggplot(data, aes(x = x, y = y)) + 
  geom_point()

g1 + 
  geom_abline(aes(intercept = -1, slope = 0.5, col = "Population")) + 
  geom_abline(aes(intercept = coef(lm.simulated)[1], slope = coef(lm.simulated)[2], col = "Least Squares")) + 
  scale_colour_manual(name = "Regression Line:", values = c("red", "blue")) +
  theme(legend.position = "bottom") + 
  labs(title = "sd(eps) = 0.5")

Part (g)

Now fit a polynomial regression model that predicts y using x and x^2. Is there evidence that the quadratic term improves the model fit?

lm.simulated.quad <- lm(y ~ x + I(x^2), data = data)
summary(lm.simulated.quad)
## 
## Call:
## lm(formula = y ~ x + I(x^2), data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.98252 -0.31270 -0.06441  0.29014  1.13500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.97164    0.05883 -16.517  < 2e-16 ***
## x            0.50858    0.05399   9.420  2.4e-15 ***
## I(x^2)      -0.05946    0.04238  -1.403    0.164    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.479 on 97 degrees of freedom
## Multiple R-squared:  0.4779, Adjusted R-squared:  0.4672 
## F-statistic:  44.4 on 2 and 97 DF,  p-value: 2.038e-14

Comments

  • The p-value for the quadratic term is 0.164, which is far above the 0.05 threshold for significance.
  • This suggests there is no statistically significant relationship between the response and the quadratic predictor.
  • This makes sense because the data itself was simulated using a linear equation, and the quadratic predictor is most likely overfitting to noise ϵ.
  • The adjusted R2 is slightly higher wth the quadratic estimator, but this is expected because the residual standard error has also decreased slightly.

Part (h)

Repeat parts (a) to (f) after modfying the data generation process in such a way that there is less noise in the data. The model (3.39) should remain the same.

# Replicating code from above but with a smaller variance - standard linear model
set.seed(1)
x <- rnorm(100)
eps <- rnorm(100, sd = 0.1)   # Now 0.1 instead of 0.5
y <- -1 + 0.5*x + eps
data <- data.frame(x, y)

lm.simulated.low.var <- lm(y ~ x)

ggplot(data, aes(x = x, y = y)) + 
  geom_point() + 
  geom_abline(aes(intercept = -1, slope = 0.5, col = "Population")) + 
  geom_abline(aes(intercept = coef(lm.simulated.low.var)[1], 
                  slope = coef(lm.simulated.low.var)[2], col = "Least Squares")) + 
  scale_colour_manual(name = "Regression Line:", values = c("red", "blue")) +
  theme(legend.position = "bottom") + 
  labs(title = "sd(eps) = 0.1")

summary(lm.simulated.low.var)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.18768 -0.06138 -0.01395  0.05394  0.23462 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.003769   0.009699  -103.5   <2e-16 ***
## x            0.499894   0.010773    46.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09628 on 98 degrees of freedom
## Multiple R-squared:  0.9565, Adjusted R-squared:  0.956 
## F-statistic:  2153 on 1 and 98 DF,  p-value: < 2.2e-16

Comments

  • There is less spread in the data.
  • Population and least squares lines are almost identical because the intercept is practically the same, and the slope estimate improves ever so slightly.
  • The adjusted R2 statistic has improved from 0.4672 to 0.956, and RSE has decreased substantially to 0.09628.
  • p-value associated with the coefficients still shows significance.

Part (i)

Repeat parts (a) to (f) after modfying the data generation process in such a way that there is more noise in the data. The model (3.39) should remain the same.

set.seed(1)
x <- rnorm(100)
eps <- rnorm(100, sd = 0.9)   # Now 0.1 instead of 0.5
y <- -1 + 0.5*x + eps
data <- data.frame(x, y)

lm.simulated.high.var <- lm(y ~ x)

ggplot(data, aes(x = x, y = y)) + 
  geom_point() + 
  geom_abline(aes(intercept = -1, slope = 0.5, col = "Population")) + 
  geom_abline(aes(intercept = coef(lm.simulated.high.var)[1], 
                  slope = coef(lm.simulated.high.var)[2], col = "Least Squares")) + 
  scale_colour_manual(name = "Regression Line:", values = c("red", "blue")) +
  theme(legend.position = "bottom") + 
  labs(title = "sd(eps) = 0.1")

summary(lm.simulated.high.var)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6892 -0.5524 -0.1255  0.4855  2.1116 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.03392    0.08729 -11.845  < 2e-16 ***
## x            0.49905    0.09695   5.147 1.36e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8665 on 98 degrees of freedom
## Multiple R-squared:  0.2128, Adjusted R-squared:  0.2048 
## F-statistic: 26.49 on 1 and 98 DF,  p-value: 1.362e-06

Comments

  • As expected, adjusted R-squared and RSE have both become worse.
  • This is because of additional noise in the dataset.

Part (j)

What are the confidence intervals for β0, β1 based on the original dataset, the dataset with less noise, and the dataset with more noise?

# For the original model
confint(lm.simulated)
##                  2.5 %     97.5 %
## (Intercept) -1.1150804 -0.9226122
## x            0.3925794  0.6063602
# For the model with low variance 
confint(lm.simulated.low.var)
##                  2.5 %     97.5 %
## (Intercept) -1.0230161 -0.9845224
## x            0.4785159  0.5212720
# For the model with high variance 
confint(lm.simulated.high.var)
##                  2.5 %     97.5 %
## (Intercept) -1.2071447 -0.8607020
## x            0.3066429  0.6914484

Comments

  • None of the confidence intervals for β0 or β1 contain zero which further corroborates our assumption that there is a relationship between the predictor and response.
  • Intervals are wider for models fit to data with more variance/noise.
  • Estimates are still very close to their true values.

Exercise 14 - Collinearity

This problem focuses on the collinearity problem. ### Part (a) Write out the form of the linear model. What are the regression coefficients?.

set.seed(1)
x1 <- runif(100)                          # First predictor - random numbers
x2 <- 0.5 * x1 + rnorm(100) / 10         # Second predictor - correlated  
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)  # Target is combination of the two

The equation is of the form

y=2+2x1+0.3x2+ϵ

The coefficients are β0=2, β1=2, β2=0.3.

Part (b)

What is the correlation between x1 and x2? Create a scatterplot displaying the relationship between the variables.

corr.value <- cor(x1, x2)
plot(x = x1, y = x2, main = paste0("Scatterplot - x1 vs x2 | Correlation = ", round(corr.value, 2)))

The correlation between x1 and x2 is 0.84.

Part (c)

Using this data, fit a least squares regression to predict y using x1 and x2. Describe the results obtained. What are the predicted coefficients? Hwo do they relate to the true coefficients? Can you reject the null hypothesis H0:β1=0? How about the null hypothesis \H0:β2=0?

lm.corr.fit <- lm(y ~ x1 + x2)
summary(lm.corr.fit)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

Comments

  • The predicted regression coefficients are [β0^,β1^,β2^] are [2.13, 1.44, 1.00] respectively.
  • The true regression coefficients are [β0,β1,β2] = [2, 2, 0.3]
  • We can reject the null hypothesis that the intercept is 0 because the estimate is non-zero and p-value is practically 0.
  • It seems the intercept and coefficients don’t follow the population regression line very well.
  • We can reject the null hypothesis that β1 is zero because the predicted intercept is non-zero. However, the p-value is just barely below the 5% threshold.
  • We cannot reject the null hypothesis that β2 is zero because the p-value associated with the non-zero estimate for β^2 is very high.

Part (d)

**Now fit a least squares regression to predict y using only x1. Comment on your results. Can you reject the null hypothesis? H_0 : _1 = 0$**

lm.corr.x1 <- lm(y ~ x1)
summary(lm.corr.x1)
## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06

Comments

  • We can reject the null hypothesis that β1=0 because the p-value associated with x1 is very small.
  • The coefficient for x1 is still very close to the corresponding coefficient in the population regression line.

Part (e)

Now fit a least squares regression to predict y using only x2. Comment on your results. Can you reject the null hypothesis H0:β2=0?

lm.corr.x2 <- lm(y ~ x2)
summary(lm.corr.x2)
## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05

Comments

  • We can reject the null hypothesis that β2=0 because the p-value associated with the coefficient is close to 0.

Part (f)

Do the results obtained in (c) - (e) contradict each other? Explain your answer.

  • There is an apparent contradiction in the results in that
    • lm.corr.fit suggests only \x1 is a statistically significant predictor for y.
    • lm.corr.x2 suggests that x2 is also statistically significant.
  • This is a consequence of x1 and x2 being correlated.
  • Even through predictors are different, the second predictor is 87% correlated with the first predictor which means when x1 is high, x2 will also be high, and the association between x2 and y is primarily a consequence of x1.
  • If x2 is statistically significant as a predictor when considered in isolation but not when considered in conjunction with x1, this means x2 is not providing new information to the model.

Part (g)

Now suppose we obtain one additional observation with was unfortunately mismeasured as follows.

x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)

Refit the linear models from (c) to (e) using this new data. What effect does this new observation have on each of the models? In each model, is this observation an outlier, a high-leverage point, or both? Explain your answers.

lm.corr.x1.x2 <- lm(y ~ x1 + x2)
lm.corr.x1 <- lm(y ~ x1)
lm.corr.x2 <- lm(y ~ x2)

# For easier processing, combine the data into a dataframe
data <- data.frame(y, x1, x2, new = 0)
data$new[length(data$new)] <- 1

get_summary_and_leverage <- function(lm_fit, title_str){
  # First, print the coefficients summary
  message(paste0("LM Summary - ", title_str)) 
  print(summary(lm_fit))
  
  # Then plot the leverage
  plt <- broom::augment(lm_fit) %>%
  cbind(new = data$new) %>%
  ggplot(aes(x = .hat, y = .std.resid, col = factor(new))) + 
  geom_point() + 
  geom_hline(yintercept = -2, col = "deepskyblue3", size = 1, alpha = 0.5) + 
  geom_hline(yintercept = 2, col = "deepskyblue3", size = 1, alpha = 0.5) + 
  geom_vline(xintercept = 3 / nrow(data), col = "mediumseagreen", size = 1, alpha = 0.5) +
  scale_color_manual(values = c("black", "red")) + 
  theme_light() +
  theme(legend.position = "none") +
  labs(x = "Leverage", 
       y = "Standardized Residual", 
       title = paste0("Residuals vs Leverage: ", title_str))
  
  print(plt)
}

Comments: y x1+x2

get_summary_and_leverage(lm.corr.x1.x2, title_str = 'y ~ x1 + x2')
## LM Summary - y ~ x1 + x2
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.73348 -0.69318 -0.05263  0.66385  2.30619 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2267     0.2314   9.624 7.91e-16 ***
## x1            0.5394     0.5922   0.911  0.36458    
## x2            2.5146     0.8977   2.801  0.00614 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared:  0.2188, Adjusted R-squared:  0.2029 
## F-statistic: 13.72 on 2 and 98 DF,  p-value: 5.564e-06

- In the original model, only x1 was significant while x2 was statistically insignificant. - With the addition of the new data point, x1 has become statistically insignificant and x2 is statistically significant. - Coefficient of x1 is now 0.5394 and coefficient of x2 is now 2.5146. - Intercept has also been skewed to become 2.2267. - The data point has a high standardized residual as well as high leverage, which explains why the results have changed so significanltly.

Comments: y x1

get_summary_and_leverage(lm.corr.x1, title_str = 'y ~ x_1')
## LM Summary - y ~ x_1
## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8897 -0.6556 -0.0909  0.5682  3.5665 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2569     0.2390   9.445 1.78e-15 ***
## x1            1.7657     0.4124   4.282 4.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared:  0.1562, Adjusted R-squared:  0.1477 
## F-statistic: 18.33 on 1 and 99 DF,  p-value: 4.295e-05

- The intercept and coefficient terms have changed in magnitude, but are both still significant. - The data point has relatively high leverage, and also has a high standardized residual so it can be considered both an outlier and high leverage data point.

Comments y x2

get_summary_and_leverage(lm.corr.x2, 'y ~ x_2')
## LM Summary - y ~ x_2
## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64729 -0.71021 -0.06899  0.72699  2.38074 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3451     0.1912  12.264  < 2e-16 ***
## x2            3.1190     0.6040   5.164 1.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared:  0.2122, Adjusted R-squared:  0.2042 
## F-statistic: 26.66 on 1 and 99 DF,  p-value: 1.253e-06

- Intercept has changed, and the coefficient for x2 is much larger. - x2 is still significant. - Data point has a high leverage, but is not an outlier because it’s standardized residual is between -2 and +2.

Exercise 15: Boston

This problem involves the Boston data set. We will try to predict the per capita crime rate using the other variables in this data set. In other words, per capita crime rate is the response, and the other variables are the predictors.
### Part (a) For each predictor, fit a simple linear regression model to predict the response. Describe your result. In which of the models is there a statistically significant association between the predictor and the response? Create some plots to back up your assertion.

# Read in the Boston dataset 
Boston <- MASS::Boston

# Convert the `chas` (Charles River bounding dummy variable) to factor 
Boston <- MASS::Boston
Boston$chas <- factor(Boston$chas)

# Write a simple function to iterate over all predictors and return lm.summary statistics
R2 <- c()
P_value <- c()
Slope <- c()
y <- Boston$crim

for (i in 1:ncol(Boston)) {
  x <- Boston[ ,i]
  
  if (!identical(y, x)) {
    # linear: y ~ x
    R2[i] <- summary(lm(y ~ x))$r.squared 
    P_value[i] <- summary(lm(y ~ x))$coefficients[2, 4]
    Slope[i] <- summary(lm(y ~ x))$coefficients[2, 1]
  } else {
    R2[i] <- NA
    P_value[i] <- NA
    Slope[i] <- NA
  }
}

crime_preds <- data.frame(varname = names(Boston), 
           R2 = round(R2, 5), 
           P_value = round(P_value, 10), 
           Slope = round(Slope, 5)) %>%
  arrange(desc(R2))

crime_preds
##    varname      R2      P_value    Slope
## 1      rad 0.39126 0.0000000000  0.61791
## 2      tax 0.33961 0.0000000000  0.02974
## 3    lstat 0.20759 0.0000000000  0.54880
## 4      nox 0.17722 0.0000000000 31.24853
## 5    indus 0.16531 0.0000000000  0.50978
## 6     medv 0.15078 0.0000000000 -0.36316
## 7    black 0.14827 0.0000000000 -0.03628
## 8      dis 0.14415 0.0000000000 -1.55090
## 9      age 0.12442 0.0000000000  0.10779
## 10 ptratio 0.08407 0.0000000000  1.15198
## 11      rm 0.04807 0.0000006347 -2.68405
## 12      zn 0.04019 0.0000055065 -0.07393
## 13    chas 0.00312 0.2094345015 -1.89278
## 14    crim      NA           NA       NA

Comments

  • Almost all variables have a statistically significant association.
  • Exceptins is the chas variable.

Part (b)

Fit a multiple regression model to predict the response using all of the predictors. Describe your results. For which predcitors can we reject the null hypothesis of H0:βj=0?

lm.boston.all <- lm(crim ~ ., data = Boston)
summary(lm.boston.all)
## 
## Call:
## lm(formula = crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.924 -2.120 -0.353  1.019 75.051 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17.033228   7.234903   2.354 0.018949 *  
## zn            0.044855   0.018734   2.394 0.017025 *  
## indus        -0.063855   0.083407  -0.766 0.444294    
## chas1        -0.749134   1.180147  -0.635 0.525867    
## nox         -10.313535   5.275536  -1.955 0.051152 .  
## rm            0.430131   0.612830   0.702 0.483089    
## age           0.001452   0.017925   0.081 0.935488    
## dis          -0.987176   0.281817  -3.503 0.000502 ***
## rad           0.588209   0.088049   6.680 6.46e-11 ***
## tax          -0.003780   0.005156  -0.733 0.463793    
## ptratio      -0.271081   0.186450  -1.454 0.146611    
## black        -0.007538   0.003673  -2.052 0.040702 *  
## lstat         0.126211   0.075725   1.667 0.096208 .  
## medv         -0.198887   0.060516  -3.287 0.001087 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.439 on 492 degrees of freedom
## Multiple R-squared:  0.454,  Adjusted R-squared:  0.4396 
## F-statistic: 31.47 on 13 and 492 DF,  p-value: < 2.2e-16

Comments

  • The F-statistic is >> -, and the associated p-value is practically 0, so we can be certain that at least one predictor is statisticaly significant and we can reject the null hypothesis that \H0:βj=0j(1,p).
  • However, not all predictors are statistically significant. In fact, the majority of predictors have p-values well above 5%.
  • Many predictors that had high statistical significance when considered in isolation do not seem to be significant in a multiple regression context.
  • The significant predictors are: rad, dis, medv, black, zn, and, just above the 5% threshold, nox. For all except nox, we can reject the null hypothesis.

Part (c)

How do you results from (a) compare to your results from (b)? Create a plot displayng the univariate regression coefficients on the x-axis and the multivariate regression coefficients on the y-axis.

lm.boston.all.coefs <- summary(lm.boston.all)$coefficients %>% as.data.frame() %>% 
  dplyr::select(Estimate)
lm.boston.all.coefs[['varname']] <- rownames(lm.boston.all.coefs)

lm.boston.single.coefs <- crime_preds %>% dplyr::select(varname, Slope) %>% 
  dplyr::rename('Estimate' = 'Slope')

lm.boston.coefs.merge <- merge(
  lm.boston.all.coefs, lm.boston.single.coefs,
  by = c('varname'), suffixes = c('_multi', '_single')
)

lm.boston.coefs.merge %>% 
  ggplot(aes(x = Estimate_multi, y = Estimate_single, color = varname)) +
  geom_point()

Part (d)

Is there evidence of non-linear association between any of the predictors and the response? To answer this question, for each predictor X, fit a model of the form

Y=β0+β1X+β2X2+β3X3+ϵ

P_value_x <- c()
P_value_x2 <- c()
P_value_x3 <- c()
R2 <- c()
y <- Boston$crim

for (i in 1:ncol(Boston)) {
  x <- Boston[ ,i]
  if (is.numeric(x)) { 
    model <- lm(y ~ x + I(x^2) + I(x^3))
    if (!identical(y, x)) {
      P_value_x[i] <- summary(model)$coefficients[2, 4]
      P_value_x2[i] <- summary(model)$coefficients[3, 4]
      P_value_x3[i] <- summary(model)$coefficients[4, 4]
      R2[i] <- summary(model)$r.squared 
    }
  } else {
    P_value_x[i] <- NA
    P_value_x2[i] <- NA
    P_value_x3[i] <- NA
    R2[i] <- NA
  }
}

data.frame(varname = names(Boston),
           R2 = round(R2, 5),
           P_value_x = round(P_value_x, 10),
           P_value_x2 = round(P_value_x2, 10), 
           P_value_x3 = round(P_value_x3, 10)) %>%
  filter(!varname %in% c("crim", "chas")) %>%
  arrange(desc(R2)) %>%
  mutate(relationship = case_when(P_value_x3 < 0.05 ~ "Cubic", 
                                  P_value_x2 < 0.05 ~ "Quadratic", 
                                  P_value_x < 0.05 ~ "Linear", 
                                  TRUE ~ "No Relationship"))
##    varname      R2    P_value_x   P_value_x2   P_value_x3    relationship
## 1     medv 0.42020 0.0000000000 0.0000000000 0.0000000000           Cubic
## 2      rad 0.40004 0.6234175212 0.6130098773 0.4823137740 No Relationship
## 3      tax 0.36888 0.1097075249 0.1374681578 0.2438506811 No Relationship
## 4      nox 0.29698 0.0000000000 0.0000000000 0.0000000000           Cubic
## 5      dis 0.27782 0.0000000000 0.0000000000 0.0000000109           Cubic
## 6    indus 0.25966 0.0000529706 0.0000000003 0.0000000000           Cubic
## 7    lstat 0.21793 0.3345299858 0.0645873561 0.1298905873 No Relationship
## 8      age 0.17423 0.1426608270 0.0473773275 0.0066799154           Cubic
## 9    black 0.14984 0.1385871340 0.4741750826 0.5436171817 No Relationship
## 10 ptratio 0.11378 0.0030286627 0.0041195521 0.0063005136           Cubic
## 11      rm 0.06779 0.2117564139 0.3641093853 0.5085751094 No Relationship
## 12      zn 0.05824 0.0026122963 0.0937504996 0.2295386205          Linear

For variables where the P_value_x3 term is 0, the predictor has evidence of a cubic relationship.